10 research outputs found
ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations
As generative AI becomes more prevalent, it is important to study how human
users interact with such models. In this work, we investigate how people use
text-to-image models to generate desired target images. To study this
interaction, we created ArtWhisperer, an online game where users are given a
target image and are tasked with iteratively finding a prompt that creates a
similar-looking image as the target. Through this game, we recorded over 50,000
human-AI interactions; each interaction corresponds to one text prompt created
by a user and the corresponding generated image. The majority of these are
repeated interactions where a user iterates to find the best prompt for their
target image, making this a unique sequential dataset for studying human-AI
collaborations. In an initial analysis of this dataset, we identify several
characteristics of prompt interactions and user strategies. People submit
diverse prompts and are able to discover a variety of text descriptions that
generate similar images. Interestingly, prompt diversity does not decrease as
users find better prompts. We further propose a new metric to quantify the
steerability of AI using our dataset. We define steerability as the expected
number of interactions required to adequately complete a task. We estimate this
value by fitting a Markov chain for each target task and calculating the
expected time to reach an adequate score in the Markov chain. We quantify and
compare AI steerability across different types of target images and two
different models, finding that images of cities and natural world images are
more steerable than artistic and fantasy images. These findings provide
insights into human-AI interaction behavior, present a concrete method of
assessing AI steerability, and demonstrate the general utility of the
ArtWhisperer dataset.Comment: 26 pages, 20 figure
Harmless interpolation of noisy data in regression
A continuing mystery in understanding the empirical success of deep neural
networks has been in their ability to achieve zero training error and yet
generalize well, even when the training data is noisy and there are more
parameters than data points. We investigate this "overparametrization"
phenomena in the classical underdetermined linear regression problem, where all
solutions that minimize training error interpolate the data, including noise.
We give a bound on how well such interpolative solutions can generalize to
fresh test data, and show that this bound generically decays to zero with the
number of extra features, thus characterizing an explicit benefit of
overparameterization. For appropriately sparse linear models, we provide a
hybrid interpolating scheme (combining classical sparse recovery schemes with
harmless noise-fitting) to achieve generalization error close to the bound on
interpolative solutions.Comment: 17 pages, presented at ITA in San Diego in Feb 201
TrueImage: A Machine Learning Algorithm to Improve the Quality of Telehealth Photos
Telehealth is an increasingly critical component of the health care
ecosystem, especially due to the COVID-19 pandemic. Rapid adoption of
telehealth has exposed limitations in the existing infrastructure. In this
paper, we study and highlight photo quality as a major challenge in the
telehealth workflow. We focus on teledermatology, where photo quality is
particularly important; the framework proposed here can be generalized to other
health domains. For telemedicine, dermatologists request that patients submit
images of their lesions for assessment. However, these images are often of
insufficient quality to make a clinical diagnosis since patients do not have
experience taking clinical photos. A clinician has to manually triage poor
quality images and request new images to be submitted, leading to wasted time
for both the clinician and the patient. We propose an automated image
assessment machine learning pipeline, TrueImage, to detect poor quality
dermatology photos and to guide patients in taking better photos. Our
experiments indicate that TrueImage can reject 50% of the sub-par quality
images, while retaining 80% of good quality images patients send in, despite
heterogeneity and limitations in the training data. These promising results
suggest that our solution is feasible and can improve the quality of
teledermatology care.Comment: 12 pages, 5 figures, Preprint of an article published in Pacific
Symposium on Biocomputing \c{opyright} 2020 World Scientific Publishing Co.,
Singapore, http://psb.stanford.edu
Development and Clinical Evaluation of an AI Support Tool for Improving Telemedicine Photo Quality
Telemedicine utilization was accelerated during the COVID-19 pandemic, and
skin conditions were a common use case. However, the quality of photographs
sent by patients remains a major limitation. To address this issue, we
developed TrueImage 2.0, an artificial intelligence (AI) model for assessing
patient photo quality for telemedicine and providing real-time feedback to
patients for photo quality improvement. TrueImage 2.0 was trained on 1700
telemedicine images annotated by clinicians for photo quality. On a
retrospective dataset of 357 telemedicine images, TrueImage 2.0 effectively
identified poor quality images (Receiver operator curve area under the curve
(ROC-AUC) =0.78) and the reason for poor quality (Blurry ROC-AUC=0.84, Lighting
issues ROC-AUC=0.70). The performance is consistent across age, gender, and
skin tone. Next, we assessed whether patient-TrueImage 2.0 interaction led to
an improvement in submitted photo quality through a prospective clinical pilot
study with 98 patients. TrueImage 2.0 reduced the number of patients with a
poor-quality image by 68.0%.Comment: 24 pages, 7 figure
Uncalibrated Models Can Improve Human-AI Collaboration
In many practical applications of AI, an AI model is used as a decision aid
for human users. The AI provides advice that a human (sometimes) incorporates
into their decision-making process. The AI advice is often presented with some
measure of "confidence" that the human can use to calibrate how much they
depend on or trust the advice. In this paper, we demonstrate that human-AI
performance can be improved by calibrating this confidence to the humans using
the advice. In practice, this means presenting calibrated AI models as more or
less confident than they actually are. We show empirically that this can
improve human-AI performance (measured as the accuracy and confidence of the
human's final prediction after seeing the AI advice). We first train a model to
predict human incorporation of AI advice using data from thousands of human
interactions. This enables us to explicitly estimate how to transform the AI's
prediction confidence, making the AI uncalibrated, in order to improve the
final human prediction. We empirically validate our results across four
different tasks--dealing with images, text and tabular data--involving hundreds
of human participants. We further support our findings with simulation
analysis. Our findings suggest the importance of and a framework for jointly
optimizing the human-AI system in contrast to the standard paradigm of
optimizing the AI model alone.Comment: 19 pages, 10 figures, in submissio